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ABSTRACT 

In a previous paper by the second author, two Markov chain Monte Carlo perfect sampling 
algorithms — one called coupling from the past (CFTP) and the other (FMMR) based on rejection 
sampling — are compared using as a case study the move-to-front (MTF) self-organizing list 
chain. Here we revisit that case study and, in particular, exploit the dependence of FMMR on 
the user-chosen initial state. We give a stochastic monotonicity result for the running time of 
FMMR applied to MTF and thus identify the initial state that gives the stochastically smallest 
running time; by contrast, the initial state used in the previous study gives the stochastically 
largest running time. By changing from worst choice to best choice of initial state we achieve 
remarkable speedup of FMMR for MTF; for example, we reduce the running time (as measured 
in Markov chain steps) from exponential in the length n of the list nearly down to n when the 
items in the list are requested according to a geometric distribution. For this same example, the 
running time for CFTP grows exponentially in n. 
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1 Introduction and summary 

Perfect sampling has had a substantial impact on the world of Markov Chain Monte 
Carlo (MCMC). In MCMC, one is interested in obtaining a sample from a distribution vr 
from which it is computationally difficult (or even infeasible) to simulate directly. One 
constructs a Markov chain whose stationary distribution is vr and after running the chain 
"a long time" takes an outcome from the chain as an (approximate) observation from vr. 



Propp and Wilson [^] (see also [13| |14] and Fill y] have devised algorithms to 
use Markov chain transitions to produce observations exactly from tt, without a priori 
estimates of the mixing time of the chain; the applicability of the latter algorithm has 
recently been extended by Fill, Machida, Murdoch, and Rosenthal ||7|, and so we will use 
the terminology "FMMR algorithm." Although the two algorithms are based on differ- 
ent ideas — Propp and Wilson use coupling from the past (CFTP) while FMMR is based 
on rejection sampling — there is a simple connection between the two, discovered in 0] 
and reviewed below. For further general discussion of perfect sampling using Markov 



chains, consult the annotated bibliography maintained on the Web by Wilson |15]. 

Much of the discussion comparing the two algorithms has focused on the issue of 
"interruptibility." FMMR has the feature that the output and the running time — when 
measured in number of Markov chain steps — are independent random variables. Thus, 
for instance, an impatient user who interrupts a run of the algorithm after any fixed 
number of steps and restarts the procedure does not introduce any bias into the output. 
This is not so for CFTP. On the other hand, for many practical applications CFTP 
is considerably easier to implement, since (see 0]) FMMR requires the user to be able 
(i) to generate a trajectory from the time-reversal of the basic chain, and (ii) to build 
couplings "ex post facto,''^ i.e., to perform certain imputation steps; CFTP requires 
neither ability. 

Remark 1.1. There is a need for time-reversal generation (of an auxiliary chain) and 
for ex post facto coupling in an extension of CFTP known as coupling into and from 
the past, introduced (under a different name) by Kendall |^. (See also Section 1.9.3 
in PI.) 



In this paper we focus on the running time of the two algorithms (but the non- 
interruptibility of CFTP will turn out to play a key role). In previous case-study 
comparisons Q the running times (and memory requirements) have been found to 
be not hugely different, but CFTP has had the edge. In this paper, by revisiting the 
case study of j^, we show that, at least in some cases, FMMR can be made to have 
much smaller running time than CFTP. 

The general observation that we exploit — one very closely related to Remark 6.9(c) 
and Section 8.2 of — is the following. Given a target distribution vr, let pcftp de- 
note the probability that CFTP terminates successfully (coalesces) over a fixed time 
window (and outputs a sample from vr). Similarly, let pcftp(-2) denote the conditional 
probability of coalescence over the time window, given that the state (call it Zcftp) 
ultimately output by CFTP (after extending the time window into the indefinite past) 
is z. Let J'fmmr(-2) denote the conditional probability that FMMR terminates suc- 
cessfully over the same time window, given that the initial state (call it .^fmmr) is z. 
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Then, as we show in Theorem 2^, pcftp(-2) = Pfmmr(-z)- That is (now letting the time 
window vary), if Tcftp and Tfmmr denote the respective running times of CFTP and 
FMMR, then conditional running time distributions agree: 

.^(TCFTP I -^CFTP = z) = -^^l^FMMR | -^FMMR = z), 

where C{X) denotes the distribution (law) of the random variable X. As a consequence, 
PCFTP = [pfmmr(-^fmmr)]; that is, >C(Tcftp) is the vr-mixture of the distributions 
(Tfmmr I -^fmmr = z)- 
The important point here is that, except in the rare instance that CFTP is in- 
terruptible for the chain of interest (i.e., that Tcftp and .^cftp are independent), 
for at least one time window there must exist at least one initial state z for which 
Pfmmr(^;) > PCFTP- 

The move-to-front (MTF) process is a nonreversible Markov chain on the permuta- 
tion group <S„. The two algorithms have been compared for MTF in a previous paper Q. 
In that paper, the initial state for FMMR was taken to be the identity permutation, and 
it was then found, roughly speaking (see Table 1 and Section 5 therein), that Tcftp 
and Tfmmr are of the same size. In this paper, we will revisit that case study and 
establish a stochastic monotonicity result for £(Tcftp | -^cftp = z) in z. It turns out, 
in particular, that the identity permutation is the worst choice of initial state! When 
we choose instead the reversal permutation, which is the best choice, we obtain a (some- 
times huge) speedup for FMMR. (See Table 1, which will be explained more fully in 
Section^ Notice that for geometric weights, the change in starting state reduces Tfmmr 
from exponential in n to about n.) The gains obtained by using the optimal z are suf- 
ficiently dramatic that, when measured in Markov chain steps, the resulting worst-case 
running time for FMMR (worst over choice of request weights) equals the best-case 



running time for CFTP: see Remark 4.2(b). 

We temper our enthusiasm, however, by recognizing that our MTF example is some- 
what artificial on two counts. Firstly, as discussed in the introduction to |^], there are 
algorithms for sampling from the MTF stationary distribution which are both more 
elementary (in particular, not involving Markov chains) and more efficient than either 
CFTP or FMMR. So we do not recommend applying either CFTP or FMMR to MTF in 
practice. Our goal here is to illustrate how judicious choice of starting state for FMMR 
can greatly improve its performance. 

Secondly, MTF has the (evidently rare) property that one can obtain an exact 
analysis of the running time distribution for FMMR for every choice of initial state z. 
We do not yet know whether our speedup ideas help in any real applications. We 
hope, however, that the ideas in this paper will stimulate further research on FMMR 
by pointing to the possibility of speedup of the algorithm. 

We briefly review the two perfect sampling algorithms and their general connection 
in Section ^. The move-to-front rule is reviewed in Section ^. Our new results are 
given in Section ^. A somewhat different approach to speeding up FMMR is given in 
Section |5[ 
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2 Perfect sampling 

We briefly review the CFTP and FMMR algorithms (omitting a few of the finer measure- 
theoretic details, which are irrelevant anyway for finite-state chains). We assume that 
our Markov chain X can be written in the stochastic recursive sequence form 

(2.1) X, = (/>(X,_i,U,), 

where <j) is called the transition rule and (Ug) is an i.i.d. sequence. We further assume 
that our Markov chain has finite state space X and is irreducible and aperiodic with 
(unique) stationary distribution vr. 



2.1 CFTP 

For a fixed positive integer t, and a Markov chain with n states, start n copies of the 
chain at time —t from each of the n states, coupling the transitions by means of the 
transition rule (j), and running the chains until time 0. If all copies of the chain agree 
at time 0, we say that the trajectories have coalesced and return the common value, 
say Z. If the chains do not agree, then increment t and restart the procedure, using for 
common values of s the same values of used in the previous step; again, check for 
coalescence. The running time of the algorithm we define to be the smallest integer t 
for which coalescence occurs. If we assume the algorithm terminates with probability 1, 
then Z ~ vr exactly. 

There is a rich source of papers, primers, and applications of CFTP. The best initial 
reference is the "Perfectly random sampling with Markov chains" Web site maintained 



by David Wilson at http : / / www . dbwilson . com/ exact/ 



2.2 FMMR 

Given a Markov chain with transition matrix K, recall that the time-reversal chain has 
transition matrix K which satisfies 

'iT{x)K{x,y) = Tr{y)K{y,x) for all x,y. 

The FMMR algorithm has two stages: First, choose an initial state Xq. Run the 
time-reversed chain K, obtaining Xo,X_i, ... in succession. Then (conditionally given 
the X- values) generate Uq, U_i, . . . independently, with Us chosen from its conditional 
distribution given (^.iD for s = 0, —1, .... (One says that the values are imputed.) 
For t = 0,1, . . . , and for each state x in the state space, set Y^_^^\x) := x and, induc- 
tively, 

Y(-*)(x) :=,/.(Yi:?(x),U,), < s < 0. 

We will sometimes refer to the realization of the chain X as the backward trajectory, 
and to the realizations of the chains Y(x) as the forward trajectories. The running time 
of the algorithm we define to be the smallest t* such that Yq (x) agree for every x 
in the state space (and hence all equal Xq). In this case the algorithm reports X_t* as 
an observation from vr. 
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Remark 2.1. The algorithms are presented here in their most general, "vanilla" ver- 
sions. A large amount of research has gone into improving both algorithms and tailoring 
them for specific applications. In particular, to improve performance a "doubling trick" 
is suggested for both algorithms whereby instead of incrementing t by one at each step, 
t is successively doubled. Since this affects the number of Markov chain steps taken only 
by constant factors, we shall for our theoretical analysis stick to the "vanilla" versions. 

Remark 2.2. For most chains of interest, the state space is very large and the imple- 
mentations presented here (running copies of the chain from every state in the state 
space) are not feasible. However, for a large class of cases where a form of monotonicity 
holds, the algorithms become practical. 

Given a Markov chain with transition matrix K, we say that we are in the (realizably) 
monotone case if the following conditions hold. The state space is a partially ordered 
set {X,<). There exist (necessarily unique) minimum and maximum elements in the 
state space, denoted and 1, respectively. There exists a monotone transition rule (j) 
for the chain. Such a rule is a function (j) : X xlA ^ X together with a random variable 
U taking values in a probability space U such that (i) (/)(x, u) < 4>{y, u) for all u G 
whenever x < y; and (ii) for each x E A", P((/)(x, U) = y) = i^(x, y) for all y E <Y. 

When in the monotone case, for CFTP one only needs to follow two trajectories of 
the chain, one started at time —t from and the other from 1, since all other trajectories 
are sandwiched between these. Likewise, in the second phase of FMMR, one only needs 
to run the Y-chain from states and 1. 

Although the two algorithms are based on different conceptual underpinnings, our 
first theorem highlights an important connection between them. Roughly, the distribu- 
tion of the running time for CFTP is equal to the stationary mixture, over initial states, 
of the distributions of the running time for FMMR. This is given as Remark 6.9(c) in Q, 
but we wish to emphasize its importance and so recast it as a theorem. We recall our 
notation from Section |^. For a fixed time window, let pcftp(-z) denote the probability 
that CFTP coalesces given that the state (call it Zcftp) ultimately output by CFTP 
is 2, and let pcFTP denote the corresponding unconditional probability. Let pfmmr(^) 
denote the conditional probability that FMMR coalesces given that the initial state (call 
it .^fmmr) is z. Let Tcftp and Tfmmr denote the respective running times of CFTP 
and FMMR (now letting the time window vary). 

Theorem 2.3. We have 

(2.2) PcFTp(^) =Pymmk{z) for vr-almost every z; 

(2.3) PCFTP = E^[pfmmr(-^fmmr)]; 

(2.4) £(rcFTp|^CFTP = z) = /^(rFMMRl^FMMR = z) for vT-almost every z; 
and 

(2.5) £(TcFTp) is the vr-mixture (over z) of >C(Tfmmr | ^fmmr = z). 
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The result holds in the most general setting, not restricted either to finite-state chains 
or to monotone transition rules. It is a consequence of the discussion in Sections 6.2 
and 8.2 of 0. For the reader's convenience we give here a simple proof for the discrete 
case. 

Proof. Each iteration of FMMR is an implementation of rejection sampling (see, e.g., 
Devroye |Q] for background). The goal is to use an observation from K*{z,-) to sim- 
ulate one from vr. One obtains an upper bound c on maXx7T{x)/K^{z,x), generates x 
with probability K^{z,x), and accepts x as an observation from vr with probability 
c~^7r{x)/K^{z,x). The unconditional probability of acceptance is then 1/c. Observe 
that, for every x, 

7r(x) Tr{z) ^ tt{z) 



K^{z,x) K^{x,z) P (coalescence to z) 

where "coalescence to z" refers, of course, to coalescence over the given time window of 
length t. Thus for the desired conditional acceptance probability given x we can use 

P(coalescence to z) 

—7 r = -r( coalescence to z trajectory from x ends at z), 

K^{x, z) 

and the FMMR algorithm is designed precisely to implement this. Thus pfmmr(-z) = 1/c 
and hence 

P(coalescence to z) 

(2.6) Pfmmr{z) = =PCFTP(^;)- 

7r(z) 



Thus, (I2.2D is immediate. Taking expectations with respect to vr gives (^). And ( p. 41) 
[from which ( p.5D is immediate] follows from ( |2.2D since, for a fixed time window of 
length t, pfmmr(-2) [respectively, j?cftp(-2^)] is the value at t of the conditional distri- 
bution function of Tfmmr given that .^fmmr = -2 (respectively, of Tcftp given that 
■^CFTP = z). □ 



Corollary 2.4. //Tcftp (^nd Zcftp o,fe not independent random variables, then there 
exist at least one time window and at least one initial state z for which pymmk{z) > 
PCFTP • 

Proof. This is immediate from (|2? 



□ 

The following simple examples are artificial, but they give a first demonstration 
that judicious choice of starting state can lead to dramatic speedup. First, consider a 
three-state Markov chain with states labeled 0, 1, and 2. Let 



K 



e (l-e)/2 (l-e)/2 
e 1 - e 
e 1 - e 
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where e > is small. One checks that this chain is reversible (but not monotone). 
Let U = 0, 1, 2 with respective probabilities e, (1 — e)/2, (1 — e)/2 and use the natural 
transition rule 

(j){x, 0) = for all X, (p{0, 1) = (p{l, 1) = (/>(1, 2) = 1, (j){0, 2) = (j){2, 1) = (j){2, 2) = 2. 

Coalescence occurs over a given time window of length t if and only if the value of 
some XJs in that window is 0; thus pcFTP = 1 ~ (1 ~ f)*; which requires t of order 1/e 
to become nonnegligible. On the other hand, if FMMR is started in state 0, then with 
high probability ( = 1 — e) we'll see (going backward in time) one of the transitions 
1 ^ or 2 <— 0. If we do, then (whichever we see) in the forward phase we impute 
U = and hence get coalescence (to state 0) in one step. 

For our second example, consider a Gibbs sampler on an attractive spin system 
with n sites arranged in a row and left-to-right site-update sweeps. (Consult, e.g., 0| 
or [|^] for background on attractive spin systems.) This gives a monotone, nonreversible 
chain where is the state consisting of all — 's and 1 is the state of all +'s. Suppose that 
the Gibbs distribution is such that there is (i) a strong external field for spin + at sites 1 
through n — 1, (ii) a much stronger effect of attractiveness throughout the system, and 
(iii) a very much stronger yet external field for spin + at site n (the rightmost site). 
First consider CFTP. The state 1 is a state of very high probability and so the chain 
won't budge out of that state for a long time. On the other hand, from 0, in one sweep 
(a full left-to-right update), we obtain [with high probability, because of (ii) and (iii)] 

. 1__ In iiiQ next sweep we obtain [with high probability, because of (ii) and (i)] 

. Continuing, in about n sweeps we obtain + + ••• + ++; that is, with high 

probability we obtain coalescence in n sweeps. On the other hand, consider FMMR 
started in 0. Here, the reversed chain is Gibbs sampling with right-to-left updates. The 
reversed chain, started in 0, [with high probability, because of (iii) and then (ii)] flips 
each site from — to + as it moves from right to left. Hence, we obtain -|- -|- • • • + 
that is, with high probability there is coalescence in one sweep. 

(Of course, were we to use right-to-left sweeps or reversible sweeps as the sampler, 
the relative disadvantage of CFTP would disappear.) 

Remark 2.5. In general, we know of no simpler expression for pfmmr(-s) than the ratio 



in (2^). In the monotone case, however, when 2: = or z = 1 we obtain significant 



simplification. Indeed, then 

Pfmmr(O) = -7— = mm and pfmmr(I) = = mm 

7r(0) ^ 7r(0) 7r(l) ^ 7r(l) 

Recall that for a Markov chain with transition matrix K and stationary distribution tt, 
the separation at time t given that the chain starts in state x is 

. ■ K\x,z) 
sep^(t) := 1 - mm -— . 

2 TT(Z) 

Thus, pfmmr(-s) = 1 — sepz{t) for 2 = 0,1, where sep refers to separation for the 
transition matrix K. See (e.g.) [j|] for more on separation. 
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3 Move-to-front 

Let {wi, . . . , Wn) be a probability mass function on {1, ... ,n} with Wi > for each i. 
In this study we are concerned with generating an observation from the distribution 

n 

(3.1) 7t{z) := H f , zeSn, 

r=l ^j=r "^^J 

where <S„ is the group of permutations of {1,... Consider samphng without re- 

placement from a population of n items, where item i has probability Wi of being chosen, 
1 < i < n. Then the probability of drawing the n items in the order z is given by ( ^.l]) . 

This distribution arises as the limiting distribution of the much-studied move-to- 
front (MTF) process. The MTF heuristic is used to "self-organize" a linear list of data 
records in a computer file. Let {1, ... , n} be a set of records (or rather the set of keys, or 
identifying labels for the records), where record i has probability Wi of being requested. 
At discrete units of time, and independent of past requests, item i is requested (with 
probability Wi) and brought to the front of the list, leaving the relative order of the 
other records unchanged. The successive orders of the list of records forms an ergodic 
Markov chain on the permutation group 5„ with stationary distribution vr. 

We will assume that the records have been labeled so that wi > ■ ■ ■ > Wn > and 
refer to w = [wi, . . . , Wn) as the weight vector of the chain. For extensive treatment of 
MTF, see |^], which contains pointers to the sizable literature on the subject. Hendricks 
P] was the first to show that the stationary distribution of the MTF Markov chain is 



given by (3.1) 



Fill @ used MTF study to compare CFTP and FMMR. We omit many 

details but for completeness describe the set-up briefly. Partially order the symmetric 
group Sn by declaring z < z' li z' can be obtained from z by a sequence of adjacent 
transpositions which switch records out of order (that begin in natural order). This is 
the weak Bruhat order. With 

:= id = (1, 2, . . . , n), 1 := rev = (n, n — 1, . . . ,1), 

we have < z < 1 for all z ^ Sn- (For the definition of the Bruhat order, used 
later, delete the word "adjacent.") The MTF chain possesses the following monotone 
transition rule with respect to the weak Bruhat order. Let C/ be a random variable 
satisfying P{U = i) = Wi for 1 < i < n. Define 

(j){z, i) = movej(z) for z ^ Sn and 1 < i < n, 

where movej(z) is defined to be the permutation resulting from the list z by requesting 
record i and applying the MTF rule. It is easily checked (see Lemma 2.2 in |^]) that 
this gives a monotone transition rule for M. 

MTF, of course, is not a reversible Markov chain; however, it is relatively straight- 
forward to generate transitions from the time-reversed chain. We refer the reader to |^ 
for further details on implementing MTF both using CFTP and using FMMR. 
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Our first result (Theorem |3.1D exhibits explicitly the dependence of Tfmmr on the 
initial state ^fmmr- In what follows, given z € 5^, let := Wz,. for 1 < r < n. In this 
notation, ( |3.1[ ) can be written in the form 



= n r 



Vr 



^ Ur-l 

where we have also introduced the notation 

r 

y+ ■=^yj, < r < n, 
i=l 

for any vector (yi, . . . ,y„). 

Theorem 3.1. (a) The conditional distribution of C{Tfmmr) given the initial state ^fmmr 
satisfies 

^(?FMMR I -^FMMR = z) = C{Tz), 

where the law ofT^ is the convolution o/Geometric(l — 7/^) distributions, < r < n — 2. 
We write 

~©:!=o'Geom(l-y+). 

(b) The random variables Tz decrease stochastically in the Bruhat order for z. 

(c) The distribution C{Tz) is stochastically minimized (respectively, maximized) by 
choosing z = rev (resp., z = id). In that case we find 

(3.2) Trcv ~ e"=2Geom(u'+) [resp., Tid ~ ©"=oGeom(l - 

Proof. Part (a) is a consequence of ( |2.4| ) in our Theorem |2.3| and Lemma 3.7 in Q; 
indeed, that lemma states that >C(Tcftp | -^cftp = z) = ©"roGeom(l — y+). For the 
weak Bruhat order. Lemma 3.9 in ^ gives part (b); but one need only compare C{Tz) 
and C{Tz') when z and z' differ by any transposition to see that part (b) holds for the 
Bruhat order. Part (c) is an immediate consequence of part (b). □ 



Remark 3.2. Theorem 3.1 for the special case of MTF belies the general Remark 2.5. 
Indeed, for every initial state for FMMR, we know exactly the distribution of Tfmmr- 
The worst starting state for coalescence is the identity permutation, and the result for 
C{Ti^) in Theorem |3.lK c) recaptures Theorem 4.2 in Q. The comparison of FMMR and 
CFTP in Q was based on starting FMMR in this worst state. In the next section we 
will discuss how much speedup can be achieved by instead starting in the best state, 
the permutation rev. 

4 Speedup results for MTF 
4.1 General weight vectors 

From now on, we abbreviate Tj-ev of ( ^.2| ) as T. We first consider how C{T) varies with 



the weight vector w. For an understanding of the terminology used in Theorem 4.1 and 
a thorough treatment of majorization, see 
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Theorem 4.1. The distribution ©"^2Geom(t(;^) of T is stochastically strictly Schur- 
concave in the weight vector w. In particular, the distribution is stochastically maxi- 
mized (respectively, minimized), over all vectors w with wi > W2 > ■ ■ ■ > Wn > 0, at the 
uniform distribution w = (1/n, . . . , 1/n) (resp., at any distribution w with W1+W2 = Ij- 

Proof. The resuh fohows simply from the fact that the Geometric (p) distribution is 
stochasticahy strictly decreasing in p. □ 



Remark 4.2. (a) The possibility wi + W2 = ^ 'is ruled out for MTF by our assumption 
that all weights are positive. Nevertheless it is a limiting case. In this limiting case, 
T = n — 1 with probability 1. At the other extreme of uniform weights, asymptotics 
for C{T) are well known (since this is a slight modification of the standard coupon 
collector's problem). (The distribution of C{T) for uniform weights is treated in detail in 
Theorem 4.3(a) and Section 4.2 of |||]. Roughly put, the distribution of T is concentrated 
tightly about nlnn.) Thus, for any sequence w^^^ = (wn), w^^^ = {w2i,W22, ...),... of 
weight vectors, writing T = Tn for the T corresponding to weight vector w^") we have 

P(T > n — 1) = and liminf liminf P(r < nlnn + cn) = 1. 

c — >—OQ n — ^00 

So the distribution of T is always tightly sandwiched between n and about nlnn, in 
sharp contrast (cf. Table 1 of or Table 1 below) to the distribution of Tjd or of Tcftp- 
(b) According to Remark 2.6 and the sentence following (3.2) in C(Tcftp) is 
strictly Schur-conuex in w. In particular, the best-case /^(Tcftp), corresponding to 
uniform weights, equals the worst-case C(T), also corresponding to uniform weights. 



4.2 Specific examples of weight vectors 

We now measure quantitatively, for certain standard examples of weight vectors, the 
speedup gained for FMMR by using the best choice of initial permutation, rev. Given 
a triangular array of weights w^"') = (wni, i = 1, . . . , n), n > 1, we say that kn steps are 
necessary and sufficient for convergence of C{T) to mean that 

Tn 

> 1 in probability. 

Here Tn denotes Tj-ev for the weight vector w^"); when there is no danger of confusion, 
we will sometimes drop the subscript n. 

For some examples of w^") we can obtain results of sharper form than provided by 
"A:„ steps are necessary and sufficient". However, for simplicity and for uniformity of 
presentation, we stick to the above definition. 

Let := Y17=i'^~" ^'^^ a > 0, and let C(q) := X^^i^"", a > 1, denote the 

Riemann zeta function. We consider the following choices of weights, where (now sup- 
pressing dependence on n in our notation) each weight vector w is listed up to a constant 
of proportionality. The numbers of steps necessary and sufficient for convergence of C{T) 



for these examples are stated in Theorem 4.3 and collected in Table 1. The second and 
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third columns of Table 1 are taken from Q. [In these columns, the meaning of "cfc„ 
steps are necessary and sufficient" is that, for some h and 

h{c) < liminf P(r„ < [ckn\) < limsupP(r„ < [cA;„J) < H{c), 

where < h{c) < H{c) < 1 for all c G (0, oo), h{c) — > as c — > 0, and H{c) ^ 1 as 
c — > oo.l The fourth column in Table 1 is the content of our next theorem. 



Weights 


Wi oc 


Uniform 
Zipf's law 

Generalized Zipf's law (GZL) 

Power law 

Geometric 


1 

ri 

a > fixed 
{n-i + iy, s>0 fixed 
e\ 0<e <1 fixed 



Table 1. Rates of convergence for £(Tcpxp) and 'C(Tpmmr)- 



Weights 


'C(TcFTp) 


C{TFMMR){worst) 


C{TFMMR){best) 


Uniform 


nlnn 


nlnn 


nlnn 


Zipf's law 


n(lnn)^ 


n(lnn)^ 


n 


GZL 








< a < 1 


y^lnn 


y^lnn 

1— a 


n 

a 


a > 1 


C(a)n° Inn 


C(a)n° Inn 


n 


Power law 






nlnn 
s+1 


Geometric 




c9-" 


n 



Theorem 4.3. (a) (Uniform weights.) If Wi = 1/n for all i, then nlnn steps are 
necessary and sufficient. 

(b) (Zipf's law.) If Wi = (Hni)'^ , with Hn ■= Hn"^ = Yllt=i^~'^ > then n steps are 
necessary and sufficient. 

(c) (Generalized Zipf's law.) When Wi = (j,"H^^^ , (i) i/0 < a < 1, then nja 
steps are necessary and sufficient, and (ii) if a > 1, then n steps are necessary and 
sufficient. 

(d) (Power law.) Fix s > 0. If Wi = {n — i + ly / f{n, s), with f{n,s) := Yl^=ij^> 
then "'g^i steps are necessary and sufficient. 

(e) (Geometric weights.) Fix < 9 < 1. If Wi = [1 — 6)9^~^ for i = I, . . . , n — 1 and 
Wn = 9^~^, then n steps are necessary and sufficient. 

Proof. We shall ignore the trivialities induced by the need to consider integer parts in 
various arguments, leaving these to the meticulous reader. 

(a) (Uniform weights.) The asymptotics here are well known, as this is essentially 
the standard coupon collector's problem. A very sharp asymptotic result is that 

P{T > [nlnn + cnj) ^ 1 - (1 + 6-^)6"^"', c € R. 
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A thorough treatment of the uniform- weights case is provided by Diaconis et al. Q. 

We estabhsh the remaining results [as we could also have established (a)] by showing, 
in each case, (i) that the number of steps kn claimed to be necessary and sufficient is the 
lead order term of E[T], that is, that E[r„] ~ /c„ as n — > oo, and (ii) that the standard 
deviation of T„ is o(E[T„]). The result then follows by application of Chebyshev's 
inequality. Showing (ii) for each of the weight examples is easy since 



n _^_ n 



< 



r^2 (-^"^ 

1 " 1 / 1 \ 

— ^— -E[T] = E[T] — -1 , 



and it is easy to check in each case that l/w2 = o(E[T]). The remainder of the proof 
thus consists of showing (i). In each case we give explicit upper and lower bounds for 

E[r]. 

(b) (Zipf's law weights.) Here 

n " 1 

n-l<E[r] = H^Y.(Hr)-'<{lnn+l)^—^ 



r=2 



r=2 



(4.1) 

Observe that 
and that 



< (lnn + 1)^ 



J 2 Inx 



n + 1 



dx 
Inx 

+ 



n+l 



dx 



2 Inx ln(n + 1) In 2 J2 (Inx)-^' 



n/(lnn)2 ^ n/(lnn)2 



(Inx) 



(ln2)^ 



and 



n+l 



dx 



< 



Thus, 



n/(ln n)2 



dx n + l 

< 



n+l 



dx 



< 



„„)2(lnx)2 ln[n/(lnn)2] y„,nj^„)2lnx ln[n/(lnn)2] ^2 Inx' 



n+l 



dx 



+ 



n 



+ 



2 Inx ln(n + 1) In 2 (lnn)^(ln2)^ ln[n/(lnn)^] 



dx 
Inx' 



I.e. 



n+l 



dx 
Inx 



< 



1 - 



1 



In [n/ (Inn) 2 
n+l 



n + l 



+ 



n 



ln(n+l) (lnn)2(ln2)2 In 2 



ln(n + 1) 



+ 



n 



(Inn)^ 
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Continuing now from ([4.1|) we have 



E[r] < (lnn+ 1) 



n + 1 



+ 



ln(n + 1) y Vlnn 



n 



n + O 



n 
Inn 



(c) (Generalized Zipf's law.) For < a < 1, 

n ^ n ^ 

BIT] = y —r^ <(n- 1)^-" V 

i i n fa) - V J ^ (r + 1)1-" - 1 

r=2 J^r r=2 ^ ^ > 

ra+1 ^ n+1 ^ 

r=3 r=3 

n+1 n+1 



n 



n 



a 



r=3 



r=3 



+ o(n), 



where Cn is defined as " if < a < 1/2, as n^^^\nn if a = 1/2, and as n "if 
1/2 < a < 1. Also, 



n 

E[r] > ((n + l)^-"-l)E(;-T)T 



r=2 



(l+o(l))n 



n 



a 



n 



a 



+ o(n). 



For a > 1, we use the fact that 



(4.2) 
Now 



= C( 



„-(Q!-1) 

a) ^ + 0(n-"). 



a — 1 



r=2 -f^T- 



r=2 
1 



^-(a-1) 

C(a) - + 0{r- 

a — \ 



C(«) 
1 

W) 

n 



n 

E 


1 - 


r=2 




n 

E 


1 + 


r=2 




+ o(n). 



(a-l)C(a) 
(a-l)C(a) 



+ 0(r-") 
+ 0(r-") + 0(r-(2"-2)) 
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Together with ( f4.2D this gives 
E[T] = 



n 

" ^ 7j-(a) 

C(a)+0(n-("-i)) 
n + o(n). 



c(a: 



+ o(n) 



(d) (Power law.) Here 

n 

E[T] = 



f{n,s) 



r=2 



(n — r + 1)* + • • • + n' 



s + 1 



n 



s+l 



(n — r) 



/(n,s)(s + l) 



n 



n ^ 
^ 1-(1-^) 



s+l ■ 



The inequahty foUows from an integral comparison. Another integral comparison shows 
that the last sum above is bounded above by 



dx 



n ' 



n 



dy 



1-y 



s+l 



--: n X I. 



Now 



1— J_ oo 
fc=0 



E 

A;=0 



1 



(s + l)A; + l 



^ ^ (s+l)fc+l 

1 — 



n 



< + 

n J s + 



1 oo 

fE 



s+ll 



1\ 1-i 
1-- + — ^ 

n J s + l 



k=l 

In ( 1 - ( 1 - - 

n 



s+V 



< 1 



^ Inn 



H ^ 

n / s + l 



1-i 



In 



s + l (s + l)s 



n 



2n2 



+ f 

n I s + l 

+ 1, 



Inn 



s + l 

where the penultimate inequality holds for all sufficiently large n [in particular, for 
n > (s + l)/2]. We then have that 



E[T] < 



/(n, s)(s + 1) / Inn 



n'' 



s + l 



+ 1 



n In n n In n ^ , , , , , , n In n 

< hn + lnn + s + l = — — r + 0(n) = (1 + o(l))- 



s + 1 



s + l 



s + l' 
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using 



n 



s+l 



For the lower bound, 



E 



f{n,s) < —T + n' 

S ~\~ -L 



s + l 



r=2 



(n + 1)^+1 - (ra + 1 - r; 



r-=2 



s+l 



s+l 



1-1- 



n + 1 



But 



E 

r-=2 



n + 1 



s+l 



-1 


rn+1 




> 




^ n+l) 




J2 





s+l 



We can show, but omit the details, that 

2 



1+1 



Inn 



s + l 



-o(i), 



so 



> ( ^ ) (n + l)lnn _ ^^^^ ^ nlnn _ ^^^^^ 



n + 1 



s + l 



s + l 



where we have used 



f{n,s)> 



n 



s+l 



s+l 



(e) (Geometric weights.) We have 
n - 1 < E[T] = 



Ti-l 



r=2 



-Et^ 



r=2 



Qr 



= n + 0(l) = (l + o(l))n. 



□ 



5 Coalescence into a set 

Here we present a different approach to speeding up the FMMR algorithm, which has 
the same spirit as the other results in this paper. In brief, recall that FMMR starts 
in a user-chosen state and then, in the second phase of the algorithm, checks whether 
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there is coalescence to that state. In the generahzation we consider here, one starts 
the algorithm in some subset of the state space (not necessarily a singleton) and then 
checks if there is coalescence back to that set (but not necessarily to the state in which 
the algorithm began). 



Theorem 5.1. In the general setting for FMMR described in Section \2.3i , let S be a 
subset of the state space and define vro(-) := 7r(-|S'). Consider the modified algorithm 
which starts in a state Xq distributed according to ttq and outputs W := X_t, where T is 
defined to be the smallest t such that all the forward trajectories from time —t coalesce 
into S, i.e., such that Y^~*)(x) € S for every state x. Then W has the stationary 
distribution it. Further, the algorithm is interruptible (i.e., T and W are independent 
random variables). 

Proof. For simplicity we consider only the discrete case. It suffices to show that 

(5.1) P{T < t, X_f = x) = P{T < t)7r{x) 

for every t and x, for then 

P{T = t, W = x) = P{T = t, X_T = x) = P{T = t, X_t = x) 

= P{T < t, X_t = x)- P{T <t-l, X_t = x) 
= PiT <t)-K{x) - P{T <t-l)i:{x) 
= P{T = t)7r{x), 

as desired. Here we have used the fact that vr is stationary for the time-reversed kernel K, 
so that 

P{T <t-l, X_i = x) 

= Y.P{T<t-l, X_(i_i) = y) P{X^t = x\T<t-l, X_((_i) = y) 
y 



= P{T <t — l)TT{y)K{y, x) by {5A) and the Markov property for K 

y 

= P{T <t-l)-K{x). 
To establish ( |5.1| ), we first observe that 



P{^-t = x) = y2no(,z)K\z,x) =7T{x)y2^K\x,z) 

^ — ' ^ — ' TTiZ) 

z z ^ ' 

(5.2) = '^.Y.^\x.z) = '^K\x,S). 

niS) ^ niS) 

One can check that, conditionally given X_j = x, the forward trajectory (X_t, . . . , Xq) 
has the same distribution as a i^-trajectory conditioned to start at x and end in S. 
Therefore, by the algorithm's design, 

P{T <t\X-t = x) 

= P(forward coalescence into S over a time-interval of length t)/ K^{x, S). 
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Combining this with ( ^.21 ) we conclude ( ^.ID , and the additional result 

P{T < t) = P(forward coalescence into S over a time-interval of length t)/TT(S). 

□ 

Remark 5.2. In the monotone case, if 5 is a down-set (meaning: z £ S and y < z 
implies y € S), then the computational problem of determining whether or not there is 
coalescence into S is eased considerably: we need only determine whether the terminal 
state (call it y) of the forward trajectory started in 1 belongs to S. And so if 5 is a 
principal down-set, that is, if 5 = {z : z < zq} for some zq, the problem is even easier: 
we need only check whether y < zq. 

We will now give a "toy" application of these ideas to MTF by describing an algo- 
rithm to build up a stationary observation in just n — 1 steps, regardless of the weights 
t^i,... ,Wn- Let vTfc denote the MTF stationary distribution on Sk restricted to the 
(normalized) weights wi, . . . ,Wk; that is, to the weights wi/w'j^ , . . . , Wk/w^ . Let MTFfc 
denote the MTF process on Sk, and let denote the set of permutations of {1, . . . ,k} 
that begin with k. Observe that Sk is the principal order ideal {z : z < z^} in the dual 
(i.e., "upside-down") Bruhat order of Sk, where Zk is the permutation (A; 1 • • • A; — 1). 
(We will not refer to the symmetric group Sk any further; thus there will be no nota- 
tional confusion with its special subset Sk-) Inductively, after k steps of our algorithm 
we will have a permutation distributed according to nk+i; thus, after n — 1 steps we will 
have an observation from vr. 

Initialize the algorithm (step 0) with the permutation (1) on {1}. Suppose that after 
k — 1 steps we have the permutation x = (xi, . . . ,Xk) distributed according to Tr^. For 
the next (fcth) step, we first get immediately an observation from TTk+i{- \ Sk+i), namely, 
(A; + 1, xi, . . . , Xfc). Then we apply the "coalesce into 5" routine of Theorem ^.1| , taking 
S = Sk+i- We claim that that routine will terminate in a single step! Indeed, in one 
time-reversed MTF^+i transition we obtain the permutation 

X — (xi , . . . , Xj—i, k -\- \, Xj , . . . , Xk) 

for some j. In the forward phase of the routine, record fc + l is brought to the front of ev- 



ery trajectory, giving coalescence into the set Sk+i- We thus conclude from Theorem 5.1 
that x' ~ TTk+i, completing the induction. 
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